Fault Tolerance
💡 Definition
Fault Tolerance refers to a system's ability to continue operating without interruption when one or more of its components fail. It is designed to prevent service interruptions due to individual component failures.
🔑 Key Concepts
- Built-in Redundancy: Components are duplicated so that if one fails, a backup component can immediately take its place.
- Graceful Degradation: A system can continue to operate, perhaps with reduced performance or functionality, even when parts of it are failing.
- Automatic Recovery: Mechanisms to automatically detect and recover from failures without human intervention.
- Loose Coupling: Designing systems so components are independent, preventing a failure in one from cascading to others.
⚙️ How it Works
In AWS, fault tolerance is built into many services (e.g., S3 is highly durable by design). For applications, it's achieved by distributing resources across Multiple AZs, using services that automatically recover (like Auto Scaling replacing unhealthy EC2 instances), and designing with redundancy.
🎯 Use Cases
- Mission-Critical Systems: Where any downtime is unacceptable (e.g., medical systems, emergency services).
- Data Durability: Ensuring data is not lost due to hardware failures (e.g., S3).
💰 Pricing Model
- Implementing fault tolerance often involves redundant resources, leading to higher costs than non-redundant setups.
📝 Exam Tips (CLF-C02)
- A key benefit of cloud computing.
- Often achieved through redundancy across AZs.
- Closely related to High Availability, but specifically focuses on the system's ability to withstand failures.
See Also: * High Availability * AZ * Shared Responsibility Model